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o Abstract 

All models are wrong and sometimes even the best of a set of models is 
useless. Modern phylogenetic comparative methods (PCMs) are almost ex- 
clusively model-based and therefore making robust inferences from PCMs 
requires using a model of trait evolution that is a good explanation for 

25 the data. To date, researchers using PCMs have evaluated the explana- 
tory power of a model only in terms of relative, not absolute, fit. Here we 
develop a general statistical framework for assessing the absolute fit, or ad- 
equacy, of phylogenetic models for the evolution of quantitative traits. We 
use our approach to test whether commonly used models are adequate de- 

30 scriptors of the macroevolutionary dynamics of real comparative data. We 
fit models of trait evolution to 337 comparative datasets covering three key 
Angiosperm functional traits and evaluated the absolute fit of the models 
to each dataset. Overall, the models we used are very inadequate for the 
evolution of these traits; this was true for many different groups and at 

35 many different scales. Furthermore, the relative support for a model had 
very little to do with its absolute adequacy. We argue that assessing model 
adequacy should be a key step in comparative analyses. 
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There are known knowns; there are things we know we know. 
We also know there are known unknowns; that is to say we 
4 o know there are some things we do not know. But there are also 

unknown unknowns — the ones we don't know we don't know. 

— Former U.S. Secretary of Defense, Donald Rumsfeld 

Introduction 

Phylogenetic trees and phylogenetic thinking are ubiquitous in modern 

45 biology; ecologists, geneticists, paleontologists and anthropologists now 
widely recognize that a historical perspective can provide important in- 
sights into evolutionary questions (Pennell and Harmon, 2013). This in- 
terest in phylogenetic biology has risen in lockstep with developments 
in phylogenetic comparative methods (PCMs; reviewed in O'Meara, 2012; 

50 Pennell and Harmon, 2013). Modern PCMs are almost exclusively based 
on probabilistic models, meaning inferences are conditional on both the 
phylogenetic tree and the chosen model. Selecting a good model is there- 
fore essential for making robust inferences. Model choice should be guided 
by two considerations. First, does the model capture the processes relevant 

55 to the question (Hansen and Orzack, 2005; Maddison, 2006; Hansen and 
Bartoszek, 2012; Pennell et al., 2013)? Second, does the model provide a 
good statistical explanation for the data? Though there is interplay be- 
tween these two questions (Hansen and Bartoszek, 2012), we focus on the 
latter in this paper. In phylogenetic comparative biology, this question has 

so been addressed by comparing the relative fit of a number of models and 
selecting the best one for inference (Mooers et al., 1999; Harmon et al., 2010; 
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Hunt, 2012). 

However, the relative support for a model is only a partial answer to 
the question of whether it is a good explanation for the data, as the best of 

65 a poor set of models is still a poor model. We also must consider whether 
the model is a good fit, or adequate, in absolute terms. Assessing model 
adequacy is a routine step in many statistical applications, especially in 
Bayesian statistics (Gelman et al., 2003). Failure to assess model adequacy 
can potentially result in erroneous inferences. A classic example of this 

70 in molecular systematics is the disputed phylogenetic relationships of ro- 
dents. A number of early molecular phylogenetic studies reported evi- 
dence that "strongly contradict[ed]" the traditional hypothesis that the or- 
der Rodentia is a monophyletic group (Graur et al., 1991; D'Erchia et al., 
1996). Sullivan and Swofford (1997) demonstrated that these conclusions 

75 were misleading, and resulted entirely from using models of molecular 
evolution that did not adequately accommodate variation in substitution 
rates across sites (also see Brown, 2014). There are many similar case- 
studies throughout and outside of evolutionary biology. 

Many models for describing the evolution of phenotypic traits along 
so a phylogeny have been developed, as well as sophisticated machinery for 
fitting them to comparative data (O'Meara, 2012). Using these models, 
researchers from across biological disciplines have made discoveries that 
would not have been possible without a phylogenetic perspective. How- 
ever, it is disconcerting that we often have no idea if the models used in 
85 PCMs are adequate — there are no general procedures that can be used to 
assess this. Even more troubling is that model adequacy is rarely ever con- 
sidered when PCMs are applied. To borrow Rumsfeld's taxonomy, model 
adequacy has largely been an "unknown unknown" in comparative biol- 
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°gy- 

90 In this paper we seek to help rectify this situation. We address two 
major outstanding problems in comparative biology. First, we develop a 
general framework for assessing the adequacy of phylogenetic models of 
trait evolution, specifically those for quantitative characters. Second, we 
assess the adequacy of commonly used models using a recently published 

95 phylogeny of Angiosperms (flowering plants) (Zanne et al., 2013) and data 
compiled from the literature on three important functional traits: specific 
leaf area, seed mass, and leaf nitrogen content. We did this to test whether 
the simple models used in comparative biology provide a reasonable de- 
scription of the macroevolutionary dynamics of real trait data, across many 

100 different groups and time scales. 

A general framework for assessing model adequacy 

Though the number of models used in comparative biology is quite large, 
most of these fall into a relatively small set of classes (O'Meara, 2012). In 
this paper, we focus on one of these classes — models that describe the evo- 
105 lution of a single continuously valued trait. More specifically, our approach 
works for models that assume that trait values at the tips are multivariate 
normal; this applies to most models of quantitative trait evolution that have 
been developed to date (O'Meara, 2012). 

If we have a phylogenetic tree consisting of n lineages and data on the 
no trait values observed at each tip X (X = x\, xj_, . . . ,x n ), we can fit a model 
M. with parameters 6 to describe the pattern of trait evolution along the 
phylogeny. Most analyses using comparative data aim to answer one of the 
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following questions: what values of 9 best explain X given Ml; or, does 
M.\ explain the data better than M.q1 Our approach is conceptually distinct 
ii 5 in that we want to ask, how likely is it that model A4 with parameters 9 
would produce a dataset similar to X if we re-ran evolution? 

There are two primary ways of fitting models to comparative data. 
The first is use maximum likelihood (ML), restricted maximum likelihood 
(REML), least-squares, etc. to obtain a point estimate of 9. The second is to 
120 estimate posterior probability distribution Pr(#|X, Ai) using Bayesian ap- 
proaches. For the models used in comparative biology, estimating Pr(0|X, M.) 
requires using Markov chain Monte Carlo (MCMC) machinery to sample 
values of 9. 

While likelihood and Bayesian approaches to model-fitting are philo- 
125 sophically different from one another, in practice, our approach to assess- 
ing model adequacy is much the same for both: i) fit the model of trait 
evolution; 2) calculate a set of summary statitics on the observed data Sx; 
3) simulate many new datasets Y\, Y2, . . . , Y m under the model using the es- 
timated parameters; 4) calculate summary statistics on the simulated data 
130 Sy r \, <Sy,2/ • • • / ^Y,m'r 5) compare Sx to the distribution of Sy. If Sx devi- 
ates significantly from the distribution of Sy, we can reject the model as 
inadequate (see figure 1). 

If we have a point estimate of the model parameters 9, we simulate 
Y\, Y2, . . . , Y m on the phylogeny accoridng to 9 and M.. We then compare 
135 a single set of summary statistics Sx calculated from our observed data to 
the distribution of values for Sy computed across all m simulated datasets. 
In statistical terminology, this procedure is known as parametric bootstrap- 
ping. Parametric bootstrapping is likely familiar to phylogenetic biologists 
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in the form of the Goldman-Cox test (Goldman, 1993) for assessing the 
adequacy of sequence evolution models. 

If we have a posterior probability distribution Pr(9\X, Ai), we can as- 
sess model adequacy using posterior predictive simulation (Rubin, 1984; 
Gelman et al., 1996). We obtain new datasets by sampling from a second 
distribution, the posterior predictive distribution 

Pt(Y\X,M) = J Fr(Y\e,M)Pr(6\X,M)de (1) 

where Pr(Y|X, M.) is the probability of a new dataset Y given X and A4, av- 
eraged over the distribution of the parameters. In practice, sampling from 
Pr(Y|X, Jvi) entails drawing samples from the joint posterior distribution 
of parameters Pr(#|X, Jvi ) and simulating data on the phylogeny according 
to the sampled parameter values. Therefore, the datasets Y\, Yi, . . . , Y m are 
each generated from different values of 9. Posterior predictive simulation 
approaches have been previously developed for models in molecular phy- 
logenetics (Bollback, 2002; Reid et al., 2013; Lewis et al., 2013; Brown, 2014), 
and recently for PCMs (Slater and Pennell, 2013), but have not been widely 
adopted in either field. 

If the chosen summary statistics were properties of the data alone, such 
as the mean of observed trait values at the tips, we would have a single 
estimate of Sx for both likelihood and Bayesian approaches. However, in 
our approach for assessing model adequacy, the summary statistics depend 
on both the data and the parameter estimates, for reasons detailed below. 
Therefore for the likelihood case, we have a single set of observed summary 
statistics Sx based on the data and the point estimate of model parameters 
6. For the Bayesian approach we have a distribution of observed summary 
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statistics Sx,i,Sx,2, ■ ■ ■ ,Sx,m> each Sx,i calculated using the same data but 
a different set of parameter values sampled from Pr(0|X, M). In this case, 
we compare the distribution of values of Sx to the distribution of values of 

Sy. 

Summary statistics 

No simulated dataset will ever be exactly the same as our observed dataset. 
We therefore need to choose informative summary statistics in order to 
evaluate whether the model predicts datasets that are similar to our ob- 
served dataset in meaningful ways. As an example, consider the case of 
regular (i.e., non-phylogenetic) logistic regression. The goal is to fit a 
model that predicts the state (o or i) of the dependent variable. We can 
assess the adequacy of such a model by simulating many datasets under 
the fitted parameters. One good summary statistic for this type of model 
is the proportion of values that are in state o. We can calculate this di- 
rectly from the observed data and from each of the simulated datasets and 
compare the observed value of this summary statistic to the distribution of 
values from the simulated datasets. This works because in a simple logistic 
regression model, we assume that the different values for the responses are 
independent of each other. However, the states at the tips of the phylogeny 
are not independent — this is why we are using PCMs in the first place! — 
and thus calculating summary statistics on the data directly is not generally 
informative for models in comparative biology. 

We account for the non-independence of the observed data by calcu- 
lating summary statistics on the set of contrasts (i.e., "phylogenetically 
independent contrasts", sensu Felsenstein, 1985) computed at each node. 
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(We refer readers to Felsenstein, 1985; Rohlf, 2001; Blomberg et al., 2012, 
for details on how contrasts are calculated.) Under Brownian motion (BM) 
the contrasts will be i.i.d. ~ J\f(0,cr), where a 1 is the BM rate parame- 
ter (Felsenstein, 1985). This i.i.d. condition allows us to perform standard 
statistical tests on the contrasts. 

The choice of what summary statistics to use for assessing model ade- 
quacy is ultimately one of balancing statistical intuition and computational 
effort. We have chosen the following set of six summary statistics to com- 
pute on the contrasts because they capture a range of possible model vio- 
lations and have well-understood statistical properties. All of these essen- 
tially evaluate whether the contrasts come from the distribution expected 
under BM. 

Mpic The mean of the squared contrasts. This is equivalent to the REML 
estimator of the Brownian motion rate parameter a 2 (Garland et al., 
1992; Rohlf, 2001). Mpic is a metric of overall rate. Violations detected 
by Mpic indicate whether the overall rate of trait evolution is over- or 
underestimated. 

VpiQ The coefficient of variation (standard deviation/ mean) for the abso- 
lute value of the contrasts. If Vpic calculated from the observed con- 
trasts is greater than that calculated from the simulated contrasts, it 
suggests that we are not properly accounting for rate heterogeneity 
across the phylogeny. If Vpic from the observed is smaller, it suggests 
that contrasts are more even than the model assumes. We use the 
coefficient of variation rather than the variance because the mean and 
variance of contrasts can be highly correlated. 

Svar The slope resulting from fitting a linear model to the absolute value 
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of the contrasts vs. their expected variances. Each (standardized) con- 
trast has an expected variance proportional to the sum of the branch 
lengths connecting the node at which it is computed to its daughter 
lineages (Felsenstein, 1985). Under a model of BM, we expect no re- 
lationship between the contrasts and their variances. We use Svar to 
test if contrasts are larger or smaller than we expect based on their 
branch lengths. If, for example, more evolution occurred per unit 
time on short branches than long branches, we would observe a neg- 
ative slope. 

Sanc Th e slope resulting from fitting a linear model to the absolute value 
of the contrasts vs. the inferred ancestral state at the correspond- 
ing node. We estimated the ancestral state using the least-squares 
method suggested by Felsenstein (1985) for the calculation of con- 
trasts. We note that this is not technically an ancestral state recon- 
struction (see Felsenstein, 1985); it is more properly thought of as 
a weighted average value for each node. We used this statistic to 
evaluate whether there is variation in rates relative to the trait value; 
for example, do larger organisms evolve proportionally faster than 
smaller ones? 

Shgt The slope resulting from fitting a linear model between the absolute 
value of the contrasts and vs. height of the node at which they are 
inferred measured from the root. This is used to capture variation 
relative to time. It is alternatively known as the "node-height test" 
and has been used to detect early bursts of trait evolution during 
adaptive radiations (Freckleton and Harvey, 2006; Slater and Pennell, 
2013). 

10 
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240 Dks The D-statistic obtained from Kolmolgorov-Smirnov test from com- 
paring the distribution of contrasts to that of a normal distribution 
with mean 0 and standard deviation equal to the root of the mean 
of squared contrasts (the expected distribution of the contrasts un- 
der BM; see Felsenstein, 1985; Rohlf, 2001). We chose this to cap- 

245 ture deviations from normality. For example, if traits evolved via a 

"jump-diffusion" type process (Landis et al., 2013), in which there 
were occasional bursts of rapid phenotypic evolution (Pennell et al., 
2013), the tip data would no longer be multivariate normal owing to 
a few contrasts throughout the tree being much larger than the rest 

250 (i.e., the distribution of contrasts would have heavy tails). 

Alternative sets of summary statisics are certainly possible. One could, 
for instance, calculate the median of the squared contrasts, the skew of 
the distribution of contrasts, etc. If the generating model was known, we 
could use established procedures for selecting a set of sufficient (or, approx- 

255 imately sufficient; Joyce and Majoram, 2008) summary statistics for that 
model, as is typically done when computing likelihood ratio tests. How- 
ever, the aim of our approach is assess the fit of a proposed model without 
reference to a true model. Our summary statistics will detect many types of 
model misspecification but this does not mean that they will necessarily de- 

260 tect every type of model misspecification; researchers interested in specific 
questions are encouraged to explore alternate sets of summary statistics. 

Beyond Brownian motion 

All of the summary statistics are designed to evaluate the adequacy of a 
BM model of trait evolution. Our summary statistics SvaR/ Sano an d Shgt 
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have been used previously in the literature with this justification (Garland 
et al., 1992, 1993; Diaz-Uriarte and Garland, 1996). However, if we pro- 
pose a different model for the evolution of the trait, such as an Ornstein- 
Uhlenbeck (OU; Hansen, 1997) process, then the expected distribution of 
the contrasts is different. The expected distribution of contrasts under most 
models of trait evolution, aside from BM, is not formally characterized and 
even if it was, this would necessitate a specific set of summary statistics for 
every model proposed. 

Our solution to this problem is to create what we term a "unit tree", 
which is a phylogenetic tree transformation that captures the dynamics of 
trait change under a particular evolutionary model. More formally, for a 
particular evolutionary model A4 (with parameter values 9), we define a 
unit tree as a phylogenetic tree that has the following property: the length 
of branch i, v-, is equal to the amount of variance expected to accumulate 
over i under M.,6. The variance is standardized, such that the expected 
distribution of the trait data on the unit tree is equal to that of a Brownian 
Motion (BM) model with a rate a 2 equal to 1. 

One can construct a unit tree by recognizing that many models of evo- 
lution can be represented via transformations of the branch lengths of a 
phylogenetic tree (O'Meara, 2012; Ho and Ane, 2014), including multi- 
rate BM (O'Meara et al., 2006; Thomas et al., 2006; Eastman et al., 2011), 
mulit-optima OU (Butler and King, 2004; Beaulieu et al., 2012; Ingram 
and Mahler, 2013), and models in which the rate and/or process change 
through time (Blomberg et al., 2003; Slater, 2013). 

Specifically, we create unit trees by transforming branch lengths using 
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290 the following general formula: 

v[ = E[Cov(A,B)] -E[Cov(A,C)] (2) 

where v[ is the transformed length of a branch i, A and C are two lineages 
subtended by the node at the rootward end of i, and A and B are two 
lineages subtended by the tip ward end of i (see figure 2). One can choose 
any pair of species that fits this property and then transverse the phylogeny 

295 in any direction so that each branch in the tree is transformed. For terminal 
branches, the formula reduces to E[Var(A)] — E[Cov(A, B)]. Measurement 
error can be incorporated by simply lengthening the terminal branches of 
the unit tree by the estimated standard error. As an example, we illustrate 
how a unit tree is constructed from the parameters of a fitted OU model 

300 (figure 2). 

If the fitted model is adequate, the trait data at the tips of the unit tree 
will have the same distribution as data generated under a BM process with 
a rate of 1 and the contrasts will be ~ A/"(0, 1) — hence the name, unit 
tree. Creating the unit tree from the estimated model parameters prior to 

305 computing the contrasts generalizes the summary statistics to most models 
of quantitative trait evolution (but see Landis et al., 2013, for an exception). 
We also emphasize that because the contrasts are calculated on the unit 
tree, the summary statistics all must depend on both the data and the 
model — it is for this reason that the Bayesian version of our approach 

310 produces a distribution of observed summary statistics. 

Once we have created the unit tree from the estimated parameters, new 
datasets can be simulated under the model simply using a BM process 
with a 2 = 1. The distribution of summary statistics calculated on these 
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simulated data sets can then be compared to the summary statistics from 
the observed data. We summarize the entire approach in figure i. We have 
implemented our method in a new R package, arbutus (see below for more 
details). 

The adequacy of models for the evolution of plant 
functional traits 

Data 

We used a phylogeny of Angiosperms, containing 30,535 species, from a 
recent study by Zanne et al. (2013). We refer interested readers to the 
original publication for details on the phylogeny. For the purposes of this 
study, we conducted all analyses on the MLE of the phylogeny (available 
on DRYAD, doi:io.5o6i/dryad.63q27/3). 

We assembled large datasets on three functionally important plant traits: 
specific leaf area (SLA, defined as fresh area/ dry mass); seed mass; and 
leaf nitrogen content (% mass). Seed mass is a crucial part of species' 
life-history strategy (Leishman et al., 2000; Westoby et al., 2002) and SLA 
and leaf nitrogen content are important and widely measured components 
of species' carbon capture strategies (Wright et al., 2004). Understand- 
ing the macroevolutionary patterns of these three traits can provide key 
insights into the evolutionary processes that have shaped much of plant 
diversity (Cornwell et al., 2014). All data are previously published; see 
https : //github/richf itz/modeladequacy for specific locations and scripts 
to access and process the original data. The SLA and leaf nitrogen data 



Downloaded from http://biorxiv.org/ on September 18, 2014 



comes from Wright et al. (2004) with additional SLA data from the LEDA 
project (Kleyer et al., 2008). Seed mass data comes the Kew database (Royal 
Botanical Gardens, Kew, 2014). We used an approximate grepping ap- 
proach to find and correct spelling mistakes and synonymy tools from The 
Plant List (2014) to match the trait databases to the Zanne et al. phylogeny. 
The full data set includes 3293 species for SLA, of which 2200 match species 
in the Zanne et al. tree. For seed mass, the dataset included 22,817 species 
with 11,107 matched the phylogeny. For leaf nitrogen content, we have data 
for 1574 species with 936 included in the tree. All data was log transformed 
prior to analyses. 

Because the vast majority of the species are only represented by a sin- 
gle record, it was not possible to use a species-specific estimate of trait 
standard error (SE) to account for either measurement error or intraspe- 
cific variation. As an alternative, we estimated a single SE for each trait by 
calculating the mean standard deviation for all species for which we had 
multiple measurements. The assumption of a constant SE across all species 
is unlikely to hold up to closer examination, but even a somewhat inac- 
curate estimate of error is better than assuming none at all (Hansen and 
Bartoszek, 2012). 

Analysis 

We first matched our trait data to the whole phylogeny and then extracted 
subclades from this dataset in a three ways: 1) by family; 2) by order; and 
3) by cutting the tree at 50 my intervals and extracting the most inclusive 
clades (named or unnamed) for which the most recent common ancestor of 
a group was younger than the time-slice. (The crown age of Angiosperms 
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is estimated to be ^243 my in the MLE tree and the tree was cut at 50, 
100, 150, and 20omy.) We kept only subclades for which there was at least 
20 species present in both the phylogeny and trait data so that we had a 
365 reasonable ability to estimate parameters and distinguish between models 
(Boettiger et al., 2012; Slater and Pennell, 2013). For SLA, this left us with 
72 clades, seed mass, 226 clades, and leaf nitrogen content, 39 clades (337 
in total). We note that these datasets are not independent as many of the 
same taxa were included in family, order and multiple time-slice subtrees. 

370 Following Harmon et al. (2010), we considered three simple models of 
trait evolution: 1) BM, which can be associated with genetic drift (Lande, 
1976; Felsenstein, 1988; Lynch, 1990; Hansen and Martins, 1996), randomly- 
varying selection (Felsenstein, 1973), or the summation of many indepen- 
dent processes over macroevolutionary time (Hansen and Martins, 1996; 

375 Uyeda et al., 2011; Pennell et al., 2013); 2) single optimum OU, which is 
often assumed to represent stabilizing selection (following Lande, 1976), 
though we think a more meaningful interpretation is that it represents an 
"adaptive zone" (Hansen, 2012; Pennell and Harmon, 2013); and 3) EB (also 
known as ACDC), which was developed as a mathematical representation 

380 of a niche-filling process during an adaptive radiation (Blomberg et al., 
2003; Harmon et al., 2010). We fit each of these models to all 337 subclades 
in our dataset. We then used the approach we developed to assess the 
adequacy of each fitted model. 

All of the analyses conducted in this paper were conducted using both 
385 likelihood and Bayesian inference. We did to demonstrate the scope of our 
approach and because both ML and Bayesian inference are commonly used 
in comparative biology. We emphasize that our approach is not tied to any 
single statistical paradigm. 
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For the likelihood analyses, we fit the three models (BM, OU, and EB) 
using ML with the divers itree package (Fitzjohn, 2012). We calculated 
the AIC score for each model. We then constructed a unit tree for each 
subtree, trait and model combination using the maximum likelihood es- 
timates of the parameters. We calculated the six summary statistics de- 
scribed above (Mpic, Vpic, SvaR/ Sano Shgtv ^ks) on the contrasts of the 
data. We simulated 1000 datasets on each unit tree using a BM model 
with a 2 = 1 and calculated the summary statistics on the contrasts of each 
simulated data set. 

For the Bayesian analysis, we fit the same models as above using a 
MCMC approach, sampling parameter values using slice sampling (Neal, 
2003), as implemented in divers itree (Fitzjohn, 2012). For the BM model 
we set a broad uniform prior on a 2 ~ U[0,2], the upper bound being sub- 
stantially larger than the ML estimate of a 2 for any clade. For the OU 
model, we used the same prior for a 2 and drew a values, the strength 
of attraction to the optimum, from a Lognormal(log(0.5),log(1.5)) distri- 
bution. A complication involved in fitting OU models is deciding what 
assumptions to make about the state at the root Zrj. Here, we follow other 
authors (Butler and King, 2004; Beaulieu et al., 2012) and assume that Zrj is 
at the optimum. For the EB model, we again used the same prior for a 2 
and a uniform prior on a, the exponential rate of decrease in a 2 , such that 
a ~ U[— 1,0] (the minimum value is much less than we would typically 
expect; Slater and Pennell, 2013). 

Again, for each model/trait/subtree combination, we ran a Markov 
chain for 10,000 generations. Preliminary investigations demonstrated that 
this was more than sufficient to obtain convergence and proper mixing for 
these simple models. After removing a burn-in of 1000 generations, we 
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calculated the Deviance Information Criterion (DIC, a Bayesian analog of 
AIC; Spiegelhalter et al., 2002) for each model. We drew 1000 samples from 
the joint posterior distribution (again, after removing burn-in). For each 
of the sampled parameter sets, we used the parameter values to construct 
a unit tree and calculated our six summary statistics on the contrasts. We 
then simulated a dataset on the same unit tree and calculated the summary 
statistics on the contrasts of the simulated data. 

In the likelihood analyses, for each dataset, we had one set Sx of ob- 
served summary statistics and a 1000 sets Sy,l' ^Y,2> • • • > ^Y/LOOO or sum- 
mary statistics calculated on data simulated on the same unit tree. In 
the Bayesian version, we had 1000 sets of observed summary statistics 
Sx,\, Sx,i, • • • , <Sx,iooo using a different unit tree for each set and 1000 sets of 
simulated summary statistics «Sy,i, «Sy,2/ ■ ■ ■ / «5y,i000/ each Sy,i corresponding 
to the unit tree used to compute Sx,i- 

For both types of analyses, we report two-tailed p-values (i.e., the prob- 
ability that the observed that a simulated summary statistic was more ex- 
treme than the observed). As a multivariate measure of model adequacy, 
we calculated the Mahalanobis distance, a scale-invariant metric, between 
the observed summary statistics and the mean of our simulated summary 
statistics, taking into account the covariance structure between the sum- 
mary statistics. We took the log of the KS D-statistic, D^S/ as the Ma- 
hanalobis measure assumes data is multivariate normal and the D-statistic 
is bounded between o and 1. For the Bayesian analyses, we report the mean 
of the distribution of Mahalanobis distances. All analyses were conducted 
in R V3.0.2 (R Development Core Team, 2013). Scripts to reproduce all 
analyses are available at https://github.com/richfitz/modeladequacy. 
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A case study: seed mass evolution in the Meliaceae and Fa- 
gaceae 

As an illustration of our approach, we present a case study examining seed 
mass evolution in two tree families, the Meliaceae, the "mahogany family", 
and Fagaceae, which contains oaks, chestnuts and beech trees. The trait 
data and phylogeny for both groups are subsets of the larger dataset used 
in the analysis. Superficially, these datasets are quite similar. Both are of 
similar size (Meliaceae: 44 species in the dataset, 550 in the clade; Fagaceae: 
70 species in the dataset and 600 in the clade), age (crown age of Meliaceae: 
^53my; Fagaceae: ^4omy) and are ecologically comparable in terms of 
dispersal strategy and climatic niche. 

As described above, we fit three simple models of trait evolution (BM, 
OU, EB) to both datasets using ML and computed AIC weights (AIC W ; 
Akaike, 1974; Burnham and Anderson, 2004) for the three models. For 
both datasets, an OU model was overwhelmingly supported (AICa, > 0.97 
for both groups). Therefore, looking only at relative model support, we 
might conclude that similar evolutionary processes are important in these 
two clades of trees. 

Examining model adequacy provides a different perspective. We took 
the MLE of the parameters from the OU models for each dataset and 
constructed a unit tree based on those parameters. We calculated our 
six summary statistics on the contrasts of the data, then simulated 1000 
datasets on the unit tree and calculated the summary statistics on the con- 
trasts of each simulated dataset (figure 3). For seed mass evolution in 
Meliaceae, the OU model was an adequate model; all six observed sum- 
mary statistics were in the middle of the distribution of simulated sum- 
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mary statistics (M P i C : p = 0.420, V mc : p = 0.533, S V ar : P = 0.605, 
Sanc : P = 0.494, S H gt : P = 0.122, D K s : p = 0.677). In contrast, for Fa- 

47 o gaceae we found that the OU model was inadequate with Mpic, the REML 
estimate of a 2 was significantly lower than the expectation based on the 
model (p ~ 0), suggesting that the process of evolution that gave rise to 
this data was more complex that that captured by a simple OU process 
(we return to this in the Discussion). The rest of the observed summary 

475 statistics did not differ significantly from the simulated summary statistics 
(V PIC : p = 0.607, Svar : P = 0.188, S ANC : p = 0.404, S HG T : p = 0.883, 
Dks '■ p = 0.957). This example serves to illustrate the distinction between 
the conventional approach to model selection in PCMs and model ade- 
quacy. For both of these datasets, OU best supported model, but is only 

4 so adequate for one of them. Selecting amongst a limited pool of models does 
not give a complete picture of whether a model is a good explanation for 
the data. 

Results 

Despite the potential for differences in the adequacy of models fit with like- 
485 lihood versus Bayesian inference, we find the results to be broadly similar. 
For clarity and conciseness, we present only the results from the likeli- 
hood analyses here. Results from the Bayesian analysis are presented in 
the Supplemental Material. Full results from all analyses are available at 
https : //github . com/richf itz/modeladequacy. 

490 Across the 337 subclades, we found widespread support for OU models. 
For 236 of clades, OU had the highest A1C W . OU had ^100% of the A1C W 
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in 27 clades and >75% of the weight in 184 clades (figure 4). Similar to the 
analysis of Harmon et al. (Harmon et al., 2010) we found very little support 
for EB models (only 6 clades supported EB with >75% AIC W ), suggesting 
that "early bursts" of trait evolution may indeed be rare in comparative 
data (but see Slater and Pennell, 2013). Larger clades were likely to have 
high support for a single model (of the 101 clades consisting of more than 
100 taxa, 53 had >go of the AIC weight on a single model), and that was 
overwhelmingly likely to be an OU model (52/ 53 clades). 

We limit our analyses of model adequacy to only the most highly sup- 
ported model in the candidate set, as supported by AIC. We did this to 
present a best-case scenario; if a model had very little relative support, it 
would be unremarkable if it also had poor adequacy (but see Ripplinger 
and Sullivan, 2010). Even considering only the best of the set, in general, 
the models had strikingly poor adequacy (figure 5). Of the 72 comparative 
datasets of SLA, all 72 rejected the best model by at least one summary 
statistic (using a cut-off of p = 0.05), 33 by at least two, and 17 by three 
or more. Results were similar in the seed mass data (of the 226 seed mass 
datasets, 185 were rejected by at least one summary statistic, 128 by at least 
two and 74 by three or more) and leaf nitrogen content (of the 39 datasets, 
all 39 could be rejected by at least one, 24 by at least two, and 11 by three or 
more summary statistic). Some summary statistics were much more likely 
to detect model violations than others. The best model was rejected by 
Mpic in 250 datasets and by Vfoc in 174; the frequency of rejection is sub- 
stantially lower by the other summary statistics (Svar : 65, Sanc : 66, Shgt : 
49/ ^ks: 71)- Across all 337 datasets, only 41 are adequately modeled by ei- 
ther BM, OU or EB — all of these are seed mass datasets. This is extremely 
worrisome as these are the most commonly used models for quantitative 
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traits in comparative biology. 

As the subclades are not independent (overlapping sets of taxa are 
present in family, order and time-slice phylogenies), conventional statistics, 
such as linear regression, are not straightforward to apply across datasets. 
Nonetheless, the trend is clear: the larger the phylogeny, the more likely 
OU is to be highly supported and the more likely the model is to be inad- 
equate. There is a strong relationship between the size of a subclade and 
the overall distance between observed and simulated summary statistics, 
as measured by the Mahanalobis distance (figure 6). This is not simply 
an artifact of conducting the analyses using a larger number of contrasts 
for the summary statistics — if the model was adequate at all scales, there 
would be no relationship between the Mahalanobis distance and the size 
of the phylogeny. As stated above, larger clades also tended to support 
a single model, meaning that the datasets for which the best model had 
a very poor absolute fit also had the most substantial difference between 
the relative fits of the three models (figure Si). There was no relationship 
between clade age and model adequacy (figure S2). 

Discussion 

The distinction between relative and absolute fit is an important one. A 
model may provide the best explanation for a dataset compared to a few 
other models but still be a very poor explanation in terms of capturing 
the patterns of variation present in the data. To again draw an analogy 
with more conventional statistics, consider fitting a regular linear regres- 
sion model to a dataset. In this case, we can often identify if the model is 
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a good absolute fit simply by plotting the data. A number of distributions 
may produce similar results (i.e., intercept, slope, p-value, etc.) but our in- 

545 ferences are contingent upon the relationship being linear (for a classic case 
study, see Anscombe, 1973). For phylogenetic models of trait evolution, 
such simple diagnostic measures are usually not informative for assessing 
model adequacy due to the complex pattern of dependency between data 
points — therefore, alternate statistical procedures are necessary. Here we 

550 have developed a general tool for this purpose. 

In our analyses of the evolution of three key Angiosperm functional 
traits, the fact that a model — most often, OU — was highly supported rel- 
ative to the others, had little to do with the model's absolute explanatory 
power. Overall, the adequacy of the three simple, but commonly used, 

555 models of trait evolution was woefully poor (figure 5). Perhaps surpris- 
ingly, this was true at across all scales, though models were increasingly 
inadequate for larger clades (figure 6). Our results raise serious concerns 
about current practices in comparative biology. Relying on a single model, 
such as BM, or even the best among a small subset of models, for inference 

560 has the potential to greatly mislead inferences about the processes that have 
driven the evolution of traits at phylogenetic scales. 

Many researchers have expressed concern that the models used in com- 
parative biology are often inappropriate, for either biological or statistical 
reasons (Felsenstein, 1985, 1988; Harvey and Pagel, 1991; Garland et al., 
565 1992; Diaz-Uriarte and Garland, 1996; Hansen and Martins, 1996; Price, 
1997; Garland et al., 1999; Garland and Ives, 2000; Hansen and Orzack, 
2005; Hansen and Bartoszek, 2012; Felsenstein, 2012; Boettiger et al., 2012; 
Slater and Pennell, 2013). Indeed, many of these issues were raised in the 
early days of the field; we are far from the the first to point this out. In this 
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paper we have developed an approach to actually quantify when a model 
provides a poor explanation for the data and demonstrate that the concerns 
about model adequacy are just not theoretical but have real implications for 
data analysis and interpretation. 

The 337 comparative datasets we analyzed varied in terms of traits, 
size, age and placement in the Angiosperm phylogeny. Nonetheless, sev- 
eral general patterns emerge. An OU model, was by and large, the most 
supported of the three we examined. In an analysis of 67 comparative 
datasets consisting of size and shape data from a variety of animal taxa, 
Harmon et al. (Harmon et al., 2010) also found substantial support for OU 
models, though for their datasets, BM was more commonly chosen by AIC 
(we note, however, that many of their datasets were quite small; see Slater 
and Pennell, 2013). Since their paper, a number of studies conducted in a 
diverse array of groups have also found OU models to be preferred over 
BM models (e.g., Burbrink et al., 2012; Quintero and Wiens, 2013; Lopez- 
Fernandez et al., 2013). 

The tendency of OU to explain data better than BM has inspired diverse 
process-based explanations, including stabilizing selection, evolutionary 
constraints and the presence of "adaptive zones" (Hansen and Martins, 
1996; Butler and King, 2004; Hansen, 2012; Pennell and Harmon, 2013). If 
the widespread support for OU models was indeed caused by the biologi- 
cal processes that have been proposed, we would expect that an OU model 
would also be widely adequate. However, this is not what we found. OU 
models are inadequate with our summary statistics, most often with Mpic 
and Vpic but frequently with others as well. OU models often failed to 
capture other important types of heterogeneity — variation with respect to 
branch lengths (Svar)/ trait values (Sanc) an d time (Shgt)- Additionally, 
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a substantial number of datasets were not well-modeled by a multivariate 
normal distribution (Dks)- These results suggest a statistical explanation 
for the high support for OU models. OU predicts higher variance near the 

600 tips of the phylogeny than do BM or EB models (see figure 1 in Harmon 
et al., 2010). Heterogeneous evolutionary processes, phylogenetic misesti- 
mation and measurement error (Houle et al., 2011; Hansen and Bartoszek, 
2012) could also produce such a pattern. In light of our results from model 
adequacy, it seems likely that OU is widely supported because it is able to 

605 accommodate more "slop" than the other models. This is not to say that the 
processes captured by OU models are unimportant in macroevolution, but 
rather that OU models may be favored for reasons that are more statistical 
than biological. 

The way in which the observed summary statistics deviate from the 
610 simulated values also supports the claim that the widespread support for 
OU is largely a statistical artifact. Model violations were most frequently 
detected by the global rate estimate Mpic, in all cases Mpic calculated from 
the observed contrasts was less than the mean of Mpic calculated from the 
contrasts of the data simulated on the unit tree. If the evolutionary pro- 
615 cess (or, alternatively, phylogenetic /measurement error) is heterogeneous 
across the tree, the lineages in some parts of the clade will be much more 
divergent than in others. The only way for the model to account for the 
highly divergent groups is to estimate a large a 2 (and/ or a small a param- 
eter for the OU model). The unit tree formed by these parameter estimates 
620 will have long branches across the entire tree. In the less divergent parts 
of the tree, the contrasts calculated on this unit tree will be small, rela- 
tive to what we expect under BM. So perhaps counter-intuitvely, when 
heterogeneity in processes across taxa cause the estimated global rates of 
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divergence to be inflated, resulting in a lower value for Mpic- For sim- 
ilar reasons, values of Vpic calculated from the observed data tended to 
be larger than the simulated values, though not as consistently as Mpic is 
smaller. 

Returning to our case study, it is unclear why an OU model appears 
to be an adequate model of seed mass evolution in Meliaceae but not in 
Fagaceae (as detected by deviations in Mpic; figure 3). Both are woody 
lineages with large, usually vertebrate dispersed seeds (Pannell and Koziol, 
1987; Manos et al., 2001). One intriguing difference between the two clades 
is that some groups within Fagaceae have recently radiated, e.g., within 
the oaks (Quercus) (Simeone et al., 2013). The same ecological processes 
may be implicated in driving rapid speciation and heterogeneous patterns 
of trait evolution across the group (Schluter, 2000), but we do not have any 
evidence that this is the case. Alternatively, the poor adequacy may not 
be due to the features of the model per se but may indicate problems in 
the data. Failure to fully account for measurement error, topological and 
branch length errors, and "outlier lineages" (see Slater and Pennell, 2013) 
can all create a poor fit between the model and data, even if the model is 
generally capturing the pattern of trait evolution along the phylogeny. 

If a model is a poor fit, there are a number of ways forward. First, we 
recommend that researchers take a close look at their data. Has measure- 
ment error been appropriately incorporated into the analysis?; are a few 
contrasts much larger than the rest?; and if so, are these associated with 
poorly supported nodes in the phylogeny? Such simple diagnostics will 
likely prove to be informative in many cases. We suspect that for many 
of the clades in which the models were all rejected, the inadequacy of the 
model could be traced to idiosyncratic problems in the underlying datasets. 
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Second, researchers may need to incorporate additional biological realism 
into their models. There has been a great deal of progress recently towards 
developing more complex models of trait evolution, such as those allowing 
multiple rates (O'Meara et al., 2006; Thomas et al., 2006; Eastman et al., 

655 2011), multiple OU optima (Butler and King, 2004; Beaulieu et al., 2012; 
Ingram and Mahler, 2013) and combinations of processes through time 
(Slater, 2013). We chose the models we did for this analysis because they 
are commonly used and because we they allowed a direct and straight- 
forward comparison between different clades in our study as well as with 

660 other similiar studies (e.g., Harmon et al., 2010). We were certainly not so 
na'ive as to think they would be appropriate at very large scales; for exam- 
ple, we knew that a single rate BM model was unrealistic for the evolution 
of seed mass across all Angiosperms (Moles et al., 2005). Though the three 
models were widely rejected across our datasets, it is likely that models 

665 that include heterogeneous processes will provide good explanations for 
the evolution of these functional traits in many groups. The patterns of 
deviation between the observed and simulated summary statistics will of- 
ten reflect interesting biology and can help guide researchers as to what 
additional processes are likely to be important in their group and need to 

670 be incorporated in the model of trait evolution. And our approach can, 
in turn, be applied to these more complex models to evaluate if they are 
indeed adequate. 

To be clear, we are not advocating in this paper that the p-values ob- 
tained from simulations be used as a means of model selection; rather, 
675 we view model checking as a step that should follow model choice and 
fitting (see Gelman et al., 2003, ch. 6, for discussion). If a model is inade- 
quate, we recommend that researchers use the summary statistics to think 
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critically about what type of additional processes need to be considered, 
perhaps perform some sort of model selection on a new set of models, and 

680 then evaluate the adequacy of a newly proposed model. Simply increasing 
the complexity of the model until it is no longer rejected by the summary 
statistics is not advisable. Model selection criteria are designed to balance 
bias and variance (the variance of a model increases with the number of 
free parameters). We do not provide any mechanism to penalize the addi- 

685 tion of extra parameters. In molecular phylogenetics, it is clear that some 
form of model selection is crucial for making reliable inferences (Sullivan 
and Joyce, 2005; Ripplinger and Sullivan, 2008), but the relationship be- 
tween different model selection criteria and model adequacy measures is 
complex (Ripplinger and Sullivan, 2010; Boettiger et al., 2012). 

690 However, it may be possible to extend our approach with an eye to- 
wards model selection. Slater and Pennell (2013) developed their posterior 
predictive simulation approach (which is related to our method) to distin- 
guish between a BM model and one where rates of evolution decreased 
through time. They chose summary statistics (including the node-height 

695 test Freckleton and Harvey, 2006, Shgt is our study) specifically to address 
this question. Slater and Pennell found using posterior predictive fit as a 
model selection criterion to be much more powerful than comparing mod- 
els using AIC or likelihood ratio tests, particularly when "outlier taxa" 
(lineages where the pattern of evolution deviates from the overall model) 

700 were included in the analysis. The logic of Slater and Pennell could be 
extended to other scenarios; to test some evolutionary hypotheses, we may 
care a lot about whether a model explains varation along some axes but be 
less concerned about others. This is a question-specific approach to model 
selection and has been developed in the context of molecular phylogenetics 
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(Bollback, 2002; Lewis et al., 2013). This is also the essence of the Decision- 
Theoretic approach to model selection (Robert, 2007), which has also been 
well-used in phylogenetics (Minin et al., 2003), but has not previously been 
considered in PCMs. 

There are a number of additional ways our approach could be extended. 
First, we have only considered a limited set of summary statistics. We chose 
them because each of these has a clear statistical expectation and observed 
deviations from them have intuitive biological explanations. However, they 
are certainly a subset of all possible summary statistics that could be ap- 
plied. For example, because contrasts are i.i.d., there should be no auto- 
correlation between neighboring contrasts; the summary statistics could be 
expanded to detect non-zero autocorrelation. Second, our approach can be 
applied equally well to phylogenetic regression models, such as phyloge- 
netic generalized least squares (Grafen, 1989) or phylogenetic mixed mod- 
els (Lynch, 1991; Hadfield and Nakagawa, 2010), where concerns regard- 
ing model adequacy are just as pertinent (Hansen and Bartoszek, 2012). In 
most phylogenetic regression models, the trait model describes the pattern 
of covariance between the residuals rather than the data (Rohlf, 2001, 2006; 
Freckleton et al., 2011) (but see Hansen et al., 2008). If we form the unit 
tree from estimated model parameters and calculate the summary statistics 
on the contrasts of the residuals, we can apply our approach. While our 
approach can be used to assess the adequacy of the phylogenetic compo- 
nent of regression models "out of the box", additional steps are required to 
assess the adequacy of the linear component. Third, our method was de- 
signed for quantitative trait models that assume data can be modeled with 
a multivariate normal distribution. We need model adequacy approaches 
for other types of traits (e.g., binary, multistate, ordinal) and for quantita- 
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tive models that do not predict a multivariate normal distribution of data 
(Landis et al., 2013). 

Arbutus 

735 We have implemented our approach in a new R package arbutus. It 
is available on github https://github.com/mwpennell/arbutus. For this 
project, we have also adopted code from the ape (Paradis et al., 2004), 
geiger (Pennell et al., in press), diversitree (Fitzjohn, 2012) and ggplot2 
(Wickham, 2009) libraries. We have written functions to parse the output 

740 of a number of different programs for fitting trait evolution models (see the 
arbutus website for an up-to-date list of supported models and packages). 
As this approach was developed to be general, we have written the code in 
such a way that users can include their own summary statistics and trait 
models in the analyses; we include demonstrations of how this can be done 

745 on the project website. 

Concluding remarks 

Attempts to assess the adequacy of phylogenetic models are almost as old 
as modern comparative phylogenetic biology. In the 1980s and 1990s much 
discussion surrounded the appropriateness of various methods and mod- 
750 els (Felsenstein, 1985, 1988; Harvey and Pagel, 1991; Garland et al., 1992; 
Diaz-Uriarte and Garland, 1996; Price, 1997; Garland et al., 1999; Garland 
and Ives, 2000). Ironically, as PCMs have become more widely adopted, 
criticism of modeling assumptions has waned (but see Losos, 2011; Felsen- 
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stein, 2012; Hansen and Bartoszek, 2012, for recent discussions). In our 
analysis of Angiosperm functional traits, we demonstrate that these con- 
cerns regarding model inadequacy are not just theoretical. Commonly used 
models in comparative biology were extremely poor explanations for the 
data; this was true both across the tree and through time. The recent de- 
velopment of models that incorporate different types of heterogeneity is 
encouraging — invoking increased complexity of processes and patterns 
will likely be necessary for making robust evolutionary inferences from 
comparative data. Evaluating the fit of macroevolutionary models should 
be an important component of any comparative analysis. We hope that our 
study can help move statistical model adequacy from being an "unknown 
unknown" to being a "known known", or at least a "known unknown". 
That is to say, we may not know exactly why our model is not capturing 
the variation in the data, but at least we will know it is not. 
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045 Results from Bayesian analyses 

As with the likelihood results (described in main text), OU models were 
highly supported across many datasets; 178/337 clades had the highest 
DIC weight (DICa;) on an OU model; 157 of them with greater than 75% of 
the total DIQy (see figure S3). While a generally similar pattern of model 

1050 support holds for both likelihood and Bayesian inference, the likelihood 
analyses are much cleaner (compare figure 4 and figure S3). This differ- 
nce can be explained by the fact that there is a tight statistical relationship 
between the AIC values for these three models. If two models have iden- 
tical likelihoods, the AIC scores, defined as — 2jC + 2k [where C is the log- 

1055 likelihood of the model and k is the number of parameters] will differ by 2. 
As BM is a special case of both OU and EB, in opposite directions in model 
space, the highest AIQ„ possible for BM is ^0.731). The rare clades were 
both OU and EB have higher support than BM likely reflect problems in 
optimization.) Calculating DIC values from posterior samples is inherently 

1060 more stochastic; if there is little information in data, the best DIC model 
will depend on the values sampled by the chain. 

For the model adequacy results, the results were also very similar to 
that of the likelihood analyses (compare to Results section in the main 
text). The adequacy of these simple models was woefully poor across most 
1065 of the datasets (figure S4). Again, we limit our analyses of model adequacy 
to only the most highly supported model in the candidate set. 

Of the 72 comparative datasets of SLA, the best supported model was 
rejected by at least one summary statistic in all 72 cases, 31 by at least two, 
and 17 by three or more. For the seed mass data, the model was rejected 
1070 by one or mor of the summary statistics in 185 datasets (by two or more in 
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128 datasets and by at least three in 75 cases). All 39 leaf nitrogen datasets 
again rejected the best supported model with at least one summary statistic 
(18 by at least two and 7 by at least three). 

Also, similar to the likelihood analyses, the frequency of rejection dif- 
1075 fered between the summary statistic. Mpic rejected the model in 246 datasets, 
and Vp IC in 169 (S VAR : 76, S ANC : 70, S HGT : 36, D KS : 68). Again, only 41 
datasets were adequately modeled by one of the three models in our candi- 
date set. There was a strong relationship between model (in)adequacy and 
clade size (figure S5), but not for clade age (figure S6). 
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Figure 1: Schematic diagram representing our approach for assessing 
model adequacy, i) Fit a model of trait evolution to the data; 2) use the 
estimated model parameters to build a unit tree; 3) compute the contrasts 
from the data on the unit tree and calculate a set of summary statistics 
Sx', 4) simulate a large number of datasets on the unit tree, using a BM 
model with a 2 = 1; 5) calculate the summary statistics on the contrasts 
of each simulated dataset <Sy; and 6) compare the observed and simulated 
summary statistics. If the observed summary statistic lies in the tails of 
the distribution of simulated summary statistics the model can be rejected 
as inadequate. The rotational circle in the center of the diagram indicates 
that assessing model adequacy is an iterative process. If a model is rejected 
as inadequate, the next step is to propose a new model and repeat the 
procedure. 
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Figure 2: Illustration of how a unit tree is constructed from model pa- 
rameters. The original phylogenetic tree (in red) can be represented by a 
variance-covariance (vcv) matrix C. The elements C m , M are the shared path- 
length from the root to the most recent common ancestor of m and n. The 
diagonal elements (m = n) are simply the total distance from the root to 
the tips. Given a model M. and parameters 6, the expected vcv of the trait 
values is described by a second matrix E. In the case of an OU model, the 

elements of E can be calculated as E m , M = ^e~ 2a ( Tmax ~ Cm - n \l — e ~ 2lxCm ' n ), 
where T max is the depth of the tree (Hansen, 1997). We can then use the 
fitted parameter values (in this example, a 2 = 0.5 and a, the strength of 
attraction towards the optimum, is 1), and equation 2 to construct the unit 
tree (blue). Focusing on a single branch i, the transformed branch length 
v- = ~La,b ~ ^a,c- Note that not only have the branch lengths changed 
relative to one another but the total tree depth T has decreased as well. 
While the original tree has branch lengths in units of time, the unit tree has 
branch lengths in units of expected (standardized) variance. 
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Figure 3: Illustration of our approach to model adequacy. We fit three mod- 
els (BM, OU, and EB) to seed mass data from two different tree families, the 
Meliaceae and the Fagaceae. In both cases, an OU model (analyzed here) 
was strongly supported when fit with ML. The plotted distributions are 
the summary statistics (Mpic, Vpic> Svar/ Sano ShgT/ C'ks) calculated from 
the contrasts of the simulated data; the bars underneath the plots represent 
95% of the density. The dashed vertical lines are the values of the sum- 
mary statistics calculated on the contrasts of the observed data. Using our 
summary statistics, an OU model appears to be an adequate model for the 
evolution of seed mass in the Meliaceae; for all of the summary statistics, 
the observed summary statistic lies in the middle of the distribution of sim- 
ulated summary statistics. For the Fagaceae, the rate estimate Mpic from 
the observed data is much lower that the rate estimate calculated on the 
simulated datasets. We can therefore reject an OU model as inadequate for 
this group (see text for details). 
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Figure 4: The relative support, as measured by AIC weight, for the three 
models used in our study (BM, OU, and EB) across all 337 datasets. An 
OU model is highly supported for a majority of the datasets. 
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Figure 5: The distribution of p-values for our six summary statistics over all 
337 datasets in our study after fitting the models using ML. The p-values 
are from applying our model adequacy approach to the best supported of 
the three models (as evaluated with AIC). For both the rate estimate Mpic 
and the coefficient of variation VpiC/ the vast majority of datasets would 
reject the best of the three models (at p < 0.05). 
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Figure 6: The relationship between clade size and a multivariate measure 
of model adequacy. The Mahalanobis distance is a scale-invariant metric 
that measures the distance between the observed and simulated summary 
statistics, taking into account the covariance between summary statistics. 
The greater the Mahalanobis distance, the worse the model captures varia- 
tion in the data. Considering only the best supported model for each clade 
(as chosen by AIC), there is a striking relationship between the two — the 
larger the dataset, the worse the models performed (note the logarithmic 
scale). If the models were equally likely to be adequate at all scales, we 
would expect no relationship. 
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Figure Si: The relationship between relative and absolute fit. For every 
clade for which a more complex model (OU, EB) was favored over BM 
using AIC, the Mahalanobis distance between the observed summary and 
simulated summary statistics is plotted against the improvement in AIC 
for the more complex model compared to BM. (Note that as all AIC val- 
ues were negative, larger differences mean greater relative support). The 
greater the relative fit of a more complex model, the more likely the model 
was to be inadequate. This result in primarily driven by clade size but 
serves to emphasize the distinction between relative and absolute fit. 
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Figure S2: The relationship between clade age and a multivariate mea- 
sure of model adequacy. Considering only the best supported of the three 
models (as selected by AIC, after fitting the models using ML), there is no 
apparent relationship between the age of clade and the distance of the ob- 
served and simulated summary statistics, as measured by the Mahalanobis 
distance. Contrast this figure with figure 6, which demonstrates a very 
tight relationship between clade size and model inadequacy. 
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Figure S3: The relative support, as measured by DIC weight, for the three 
models used in our study (BM, OU, and EB) across all 337 datasets. All 
models were fit with MCMC. Like the model comparisons done with AIC, 
an OU model is highly supported for a majority of the datasets. 
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Figure S4: The distribution of p-values for our six summary statistics over 
all 337 datasets in our study after fitting the models using MCMC. The 
p-values are from applying our model adequacy approach to the best sup- 
ported of the three models (as evaluated with DIC). For both the rate esti- 
mate Mpic and the coefficient of variation Vpic, the vast majority of datasets 
would reject the best of the three models (at p < 0.05). 
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Figure S5: The relationship between clade size and a multivariate measure 
of model adequacy from the Bayesian analysis. The Mahalanobis distance 
is a scale-invariant metric that measures the distance between the observed 
and simulated summary statistics, taking into account the covariance be- 
tween summary statistics. The greater the Mahalanobis distance, the worse 
the model captures variation in the data. Considering only the best sup- 
ported model for each clade (as chosen by DIC), there is a striking rela- 
tionship between the two — the larger the dataset, the worse the models 
performed (note the logarithmic scale). If the models were equally likely 
to be adequate at all scales, we would expect no relationship. 
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Figure S6: The relationship between clade age and a multivariate measure 
of model adequacy. Considering only the best supported of the three mod- 
els (as selected by AIC, after fitting the models using MCMC), there is no 
apparent relationship between the age of clade and the distance of the ob- 
served and simulated summary statistics, as measured by the Mahalanobis 
distance. Contrast this figure with figure S5, which demonstrates a very 
tight relationship between clade size and model inadequacy. 
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